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I. INTRODUCTION 

In noncooperative game theory, one has a set of N 
players, each choosing its strategy Xi independently, by 
sampling a distribution qi{xi) over those strategies. Each 
player i also has her own utility function gi{x), specify- 
ing how much reward she gets for every possible joint- 
strategy X of all N players. Let mean the joint 
probability distribution of all players other than i, i.e., 
Ylj^i Qj{xj)- Then the "goal" of each player i is to set qi 
to so that, conditioned on the expected value of i's 
utility is as high as possible. 

Conventional game theory assumes each player i is 
"fully rational" , able to solve for that optimal qt , and 
that she then uses that distribution. It is primar- 
ily concerned with analyzing the such equilibria of the 
game H, ^ IE Bj- In the real world, this assump- 
tion of full rationality almost never holds, whether the 
layers are humans, animals, or computational agents 
"BHIlSIIllIlllillllli- This is due to the cost of 
computation of that optimal distribution, if nothing else. 
This real-world bounded rationality is one of the ma- 
jor impediments to applying conventional game theory in 
the real world. 

This paper shows how Shannon's information theory 
[TgL ItR ITsI provides a principled way to modify con- 
ventional game theory to accommodate bounded ratio- 
nality. This is done by following information theory's 
prescription that, given only partial knowledge concern- 
ing the distributions the players are using, we should use 
the Maximum Entropy (Maxent) principle to infer those 
distributions. Doing so results in the principle that the 
bounded rational equilibrium is the minimizer of a cer- 
tain set of coupled Lagrangian functions of the joint dis- 
tribution, q{x) — Yii qi{xi). This mathematical structure 
is a special instance of Product Distribution (PD) theory 
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In addition to showing how to formulate bounded ra- 
tionality, PD theory provides many other advantages to 
game theory. Its formulation of bounded rationality ex- 



plicitly includes a term that, in light of information the- 
ory, is naturally interpreted as a cost of computation. 
PD theory also seamlessly accommodates multiple util- 
ity functions per player. It also provides many powerful 
techniques for finding (bounded rational) equilibria, and 
helps address the issue of multiple equilibria. Another 
advantage is that by changing the coordinates of the un- 
derlying space of joint moves x, the same mathematics 
describes a type of bounded rational cooperative game 
theory, in which the moves of the players are transformed 
into contracts they all offer one another. 

Perhaps the most succinct and principled way of deriv- 
ing statistical physics is as the application of the Maxent 
principle. In this formulation, the problem of statistical 
physics is cast as how best to infer the probability dis- 
tribution over a system's states when one's prior knowl- 
edge consists purely of the expectation values of certain 
functions of the system's state 0, 01 • For example, 
this prescription says we should infer that the probabil- 
ity distribution p governing the system is the Boltzmann 
distribution when our prior knowledge is the system's 
expected energy. This is known as the "canonical en- 
semble" . Other ensembles arise when other expectation 
values are added to one's prior knowledge. In particu- 
lar, if the number of particles in the system is uncertain, 
but one knows its expectation value, one arrives at the 
"grand canonical ensemble" . 

One major difficulty with working with these ensem- 
bles is that under them the particles of the system are sta- 
tistically coupled with one another. For high-dimensional 
systems, this can make statistical physics calculations 
very difficult. Accordingly, a large body of work has been 
produced under the rubric of Mean Field (MF) theory, in 
which the ensemble is approximated with a distribution 
in which the particles are independent . In an MF ap- 
proximation, a product distribution q governs the joint 
state of the particles — just as a product distribution 
governs the joint strategy of the players in a game. 

MF approximations are usually derived in an ad hoc 
manner. The principled way to derive a MF approxima- 
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tion (or any other kind) to a particular ensemble is to 
specify a distance measure saying how close two prob- 
ability distributions are, and then solve for the q that 
is closest to the distribution being approximated, p. To 
do this one needs to specify the distance measure. How 
best to measure distances between probability distribu- 
tions is a topic of ongoing controversy and research |26|. 
The most common way to do so is with the infinite limit 
log likelihood of data being generated by one distribution 
but misattributed to have come from the other. This is 
known as the KuUback-Leibler (KL) distance [TgL IT7l l27l| . 
It is far from being a metric. In particular, it is not sym- 
metric under interchange of the two distributions being 
compared. 

It turns out that the simplest MF theories minimize the 
KL distance from q to p. However it can be argued it is 
the KL distance from pio q that is the most appropriate 
measure, not the KL distance from q to p. Using that 
distance, the optimal g is a new kind of approximation 
not usually considered in statistical physics. 

For the canonical ensemble, the type of KL distance 
arising in simple MF theories turns out to be identical 
to the maxent Lagrangian arising in bounded rational 
game theory. This shows how bounded rational (inde- 
pendent) players are formally identical to the particles in 
the MF approximation to the canonical ensemble. Un- 
der this identification, the moves of the players play the 
roles of the states of the particles, and particle energies 
are translated into player utilities. The coordinate trans- 
formations which in game theory result in cooperative 
games are, in statistical physics, techniques for more al- 
lowing the canonical ensemble to be more accurately ap- 
proximated with a product distribution. 

This identification raises the potential of transferring 
some of the powerful mathematical techniques that have 
been developed in the statistical physics community to 
the analysis of noncooperative game theory. In also sug- 
gests translating some of the other ensembles of statisti- 
cal physics to game theory, in addition to the canonical 
ensemble. As an example, in the grand canonical ensem- 
ble the number of particles is variable, which, after a MF 
approximation, corresponds to having a variable number 
of players in game theory. Among other applications, this 
provides us with a new framework for analyzing games in 
evolutionary scenarios, different from evolutionary game 
theory. 

In the next section noncooperative game theory and in- 
formation theory are cursorily reviewed. Then bounded 
rational game theory is derived, and its many advantages 
are discussed. The following section starts with a cursory 
review of the information-theoretic derivation of statisti- 
cal physics. After that is a discussion of the two kinds of 
KL distance and the MF theories they induce, and a dis- 
cussion of coordinate systems. This section also includes 
a discussion on translating a MF version of the grand 
canonical ensemble into a new kind of evolutionary game 
theory. 

As discussed in the physics section, the maxent La- 



grangian and associated Boltzmann solution at the core 
of this paper has been investigated for an extremely long 
time in the context of many-particle systems. Considered 
in the context of a 2-player game with nature, the Boltz- 
mann solution has also been studied for many years in 
the reinforcement learning community |2a.l29||. Related 
work has considered it in the context of "mechanism de- 
sign" of many players, i.e., in the context of designing the 
utility functions of the play ers to induce them to maxi- 
mize social welfare |33, |3l|, |33, 123 • 

It turns out that independent of the work reported in 
this paper, the maxent Lagrangian and its Boltzmann so- 
lution has been been muted in the context of game theory 
0,13 US- However its motivation in that work is some- 
what ad hoc. In that work there is no use of information 
theory nor discussion of the relation between bounded 
rational game theory and mean field theory. There is 
also no relation of the maxent Lagrangian to the cost of 
computation, multiple cost functions, rationality opera- 
tors, or the kinds of alternatives to evolutionary game 
theory discussed below. Nor is there discussion of semi- 
coordinate transformations and their relation to cooper- 
ative game theory. 

Finally, it's important to note that PD theory also has 
many applications in science beyond those considered in 
this paper. For example, see [H IH HI IHHil for 
work relating the maxent Lagrangian to distributed con- 
trol and to distributed optimization. See [s^l for algo- 
rithms for speeding up convergence to bounded rational 
equilibria. Some of those algorithms are related to sim- 
ulated and deterministic annealing 27]. In [23 oth- 
ers of those algorithms are related to Stackelberg games, 
and more generally to the problem of finding the opti- 
mal control hierarchy for team of players with a common 
goal, i.e., f inding an optimal organization chart. See also 
|39l liol l4ll | for work showing, respectively, how to use PD 
theory to improve Metropolis-Hastings sampling, how to 
relate it to the mechanism design work in 30, 31. 32., .33| , 
and how to extend it to continuous move spaces and time- 
extended strategies. 



II. PD THEORY AS BOUNDED RATIONAL 
NONCOOPERATIVE GAME THEORY 

This section motivates PD theory as a way of address- 
ing several of the shortcomings of conventional noncoop- 
erative game theory. 



A. Review of noncooperative game theory 

In noncooperative game theory one has a set of N 
players. Each player i has its own set of allowed pure 
strategies. A mixed strategy is a distribution qi{xi) 
over player i's possible pure strategies. Each player i also 
has a utility function gi that maps the pure strategies 
adopted by all N of the players into the real numbers. 
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So given mixed strategies of all the players, the expected 
utihty of player i is E{g,) = J dx Hj <lj{xj)9i{x) 

This basic framework can be elaborated to model 
many interactions between biological organisms, and in 
particular between human beings. These interactions 
range from simple abstractions like the famous prisoner's 
dilemma to iterated games like chess, to international re- 
lations nil El. 

Much of noncooperative game theory is concerned with 
equilibrium concepts specifying what joint-strategy 
one should expect to result from a particular game. In 
particular, in a Nash equilibrium every player adopts 
the mixed strategy that maximizes its expected utility, 
given the mixed strategies of the other players. More 
formally, Vi, gi = argmax^/ / dx g- Hj^i qj{xj) gi{x). 

Several very rich fields have benefited from a close re- 
lationship with noncooperative game theory. Particular 
examples are evolutionary game theory (in which the set 
of N players is replaced by an infinite set of reproduc- 
ing organisms) and cooperative game theory (in which 
layers choose which coalitions of other players to join) 
14^ . Game theory as a whole is also closely related 
to economics, in particular the field of mechanism de- 
sign, which is concerned with how to induce the set of 
l ayer s to do adopt a socially desirable joint-strategy 



B. Problems with conventional noncooperative 
game theory 

A number of objections to the Nash equilibrium con- 
cept have been resolved. In particular, it was Nash who 
proved that every game has at least one Nash equilib- 
rium if one expands the realm of discourse to include 
mixed strategies. (The same is not true for pure strate- 
gies.) Other objections have been more or less resolved 
through numerous refinements of the Nash equilibrium 
concept. 

However there are several major problems with the 
concept that are still outstanding. One of them is the 
possible multiplicity of equilibria; this multiplicity means 
the Nash equilibrium concept cannot be used to specify 
the joint strategy that is actually adopted in a real world 
game. (Some refinements of the Nash equilibrium con- 
cept attempt to address this problem, though none has 
succeeded.) Another problem is that while calculating 
Nash equilibria is straightforward in many simple games 
(e.g., 2 players in a zero-sum game), calculating them 
in the general case can be a very difficult computational 
multi-criteria optimization problem. Yet another prob- 
lem is that there is no general way to extend the concept 
to allow each player to have multiple utility functions. 

However perhaps the major problem with the Nash 
equilibrium concept is its assumption of full rational- 
ity. This is the assumption that every player i can both 
calculate what the strategies qj^i will be and then calcu- 
late its associated optimal distribution. In other words. 



it is the assumption that every player will calculate the 
entire joint distribution q(x) = qj{xj). If for no other 
reasons than computational limitations of real humans, 
this assumption is essentially untenable. This problem is 
just as severe if one allows statistical coupling among the 
players |l,li3- 

A large body of empirical lore has been generated char- 
acterizing the bounded rationality of humans. Similarly 
much has been learned about the empirical behavior 
of (bounded rational) machine learning computer algo- 
rithms playing games with one another 0, 0| . None of 
this work has resulted in a full mathematical theory of 
bounded rationality however. 

There have also been numerous theoretical attempts 
to incorporate bounded rationality into noncooperative 
game theory by modifying the Nash equilibrium con- 
cept. Some of them assume essentially that every player's 
mixed strategy is its Nash-optimal strategy with some 
form of noise superimposed ^] . Others explicitly model 
the humans, typically as computationally limited au- 
tomata, and assume the automata perform optimally 
subject to those computational limitations Both 
approaches, while providing insight, are very ad hoc as 
models of games involving real-world organisms or real- 
world (i.e., non-trivial) machine learning algorithms. 

The difficulty of calculating equilibria is addressed in 
the sections below on solving for the distributions of PD 
theory. The rest of this section shows how information 
theory can be used to extend game theory to avoid its 
other shortcomings. Finally, the sections after this one 
present some other extensions of game theory, in partic- 
ular to allow for a variable number of players. (Games 
with variable number of players arise in many biological 
scenarios as well as economic ones.) 



C. Review of the maximum entropy principle 

Shannon was the first person to realize that based 
on any of several separate sets of very simple desider- 
ata, there is a unique real-valued quantification of the 
amount of syntactic information in a distribution P{y). 
He showed that this amount of information is (the nega- 
tive of) the Shannon entropy of that distribution, S{P) — 
-JdyP{y)ln[^] M- 

So for example, the distribution with minimal infor- 
mation is the one that doesn't distinguish at all between 
the various y, i.e., the uniform distribution. Conversely, 
the most informative distribution is the one that specifies 
a single possible y. Note that for a product distribution, 
entropy is additive, i.e., S{1\^ qiivi)) = J^i^iqi)- 

Say we given some incomplete prior knowledge about a 
distribution P{y). How should one estimate P{y) based 
on that prior knowledge? Shannon's result tells us how to 
do that in the most conservative way: have your estimate 
of P{y) contain the minimal amount of extra information 
beyond that already contained in the prior knowledge 
about P{y). Intuitively, this can be viewed as a version 
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of Occam's razor. This approach is called the maximum 
entropy (maxent) principle. It has proven extremely use- 
ful in domains ranging from signal processing to image 
processing to supervised learning [l7| . 

D. Maxent Lagrangians 

Much of the work on equilibrium concepts in game the- 
ory adopts the perspective of an external observer of a 
game. We are told something concerning the game, e.g., 
its utility functions, information sets, etc., and from that 
wish to predict what joint strategy will be followed by 
real-world players of the game. Say that in addition to 
such information, we are told the expected utilities of 
the players. What is our best estimate of the distribu- 
tion q that generated those expected utility values? By 
the maxent principle, it is the distribution with maximal 
entropy, subject to those expectation values. 

To formalize this, for simplicity assume a finite num- 
ber of players and of possible strategies for each player. 
To agree with the convention in other fields, from now on 
we implicitly flip the sign of each gi so that the associ- 
ated player i wants to minimize that function rather than 
maximize it. Intuitively, this flipped gi{x) is the "cost" 
to player i when the joint-strategy is x, rather than its 
utility then. 

Then for prior knowledge that the expected utilities of 
the players are given by the set of values {e^}, the maxent 
estimate of the associated q is given by the minimizer of 
the Lagrangian 

L{q) EE Y.p.,[E,ig,)-e,]-Siq) 

i 

= Yl^'^J Y['ljixj)gr{x) ~ e^] - Siq) (1) 

i 3 

where the subscript on the expectation value indicates 
that it evaluated under distribution q, and the {Pi} are 
Lagrange parameters imp licitly set by the constraints on 
the expected utilities 50]. 

Solving, we find that the mixed strategies minimizing 
the Lagrangian are related to each other via 

g,(xOoce~^'(')(^l"') (2) 

where the overall proportionality constant for each i is 
set by normalization, and G = ^ - Ptgi, and the subscript 
q(^i) on the expectation value indicates that it is evaluated 
according the distribution Ylj=£i 1j ■ I'^ ^'i- El the proba- 
bility of player i choosing pure strategy Xi depends on the 
effect of that choice on the utilities of the other players. 
This reflects the fact that our prior knowledge concerns 
all the players equally. 

If we wish to focus only on the behavior of player j, 
it is appropriate to modify our prior knowledge. To see 
how to do this, first consider the case of maximal prior 
knowledge, in which we know the actual joint-strategy of 



the players, and therefore all of their expected costs. For 
this case, trivially, the maxent principle says we should 
"estimate" q as that joint-strategy (it being the q with 
maximal entropy that is consistent with our prior knowl- 
edge). The same conclusion holds if our prior knowledge 
also includes the expected cost of player i. 

Now modify this maximal set of prior knowledge by 
removing from it specification of player i's strategy. So 
our prior knowledge is the mixed strategies of all players 
other than i, together with player j's expected cost. We 
can incorporate the prior knowledge of the other players' 
mixed strategies directly into our Lagrangian, without 
introducing Lagrange parameters. That maxent La- 
grangian is 

Li{qi) = (ii[ei - E{gi)] ~ Si{qi) 

= (3i[ti- j dx 'Y\^qj{xj)gi{x)] - Si{qi) 

3 

with solution given by a set of coupled Boltzmann dis- 
tributions: 

g.(x,)oce-^'^'<-)(^'l^'\ (3) 

Following Nash, we can use Brouwer's fixed point the- 
orem to establish that for any non-negative values {/?}, 
there must exist at least one product distribution given 
by the product of these Boltzmann distributions (one 
term in the product for each i). 

The first term in Li is minimized by a perfectly rational 
player. The second term is minimized by a perfectly irra- 
tional player, i.e., by a perfectly uniform mixed strategy 
qi. So Pi in the maxent Lagrangian explicitly specifies the 
balance between the rational and irrational behavior of 
the player. In particular, for (3 — > oo, by minimizing the 
Lagrangians we recover the Nash equilibria of the game. 
More formally, in that limit the set of q that simultane- 
ously minimize the Lagrangians is the same as the set of 
delta functions about the Nash equilibria of the game. 
The same is true for Eq. |21 

Eq. |21is just a special case of Eq. |31 where all player's 
share the same cost function G. (Such games are known 
as team games.) This relationship reflects the fact 
that for this case, the difference between the maxent La- 
grangian and the one in Eq. ^is independent of qi. Due 
to this relationship, our guarantee of the existence of a 
solution to the set of maxent Lagrangians implies the 
existence of a solution of the form Eq. |21 

Typically players aren't close to perfectly self- 
defeating. Almost always they will be closer to mini- 
mizing their expected cost than maximizing it. For prior 
knowledge consistent with such a case, the (3i are all non- 
negative. 

Finally, our prior knowledge often will not consist of 
exact specification of the expected costs of the players, 
even if that knowledge arises from watching the players 
make their moves. Such other kinds of prior knowledge 
are addressed in several of the following subsections. 
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E. Alternative interpretations of Lagrangians 

There are numerous alternative interpretations of these 
results. For example, change our prior knowledge to be 
the entropy of each player i's strategy, i.e., how unsure 
it is of what move to make. Now we cannot use informa- 
tion theory to make our estimate of q. Given that players 
try to minimize expected cost, a reasonable alternative 
is to predict that each player z's expected cost will be as 
small as possible, subject to that provided value of the 
entropy and the other players' strategies. The associated 
Lagrangians are ai[S{qi) — <Ti\ — E(gi), where at is the 
provided entropy value. This is equivalent to the max- 
ent Lagrangian, and in particular has the same solution, 
Eq.El 

Another alternative interpretation involves world 
cost functions, which are quantifications of the qual- 
ity of a joint pure strategy x from the point of view 
of an external observer (e.g., a system designer, the 
government, an auctioneer, etc.). A particular class 
of world cost functions are "social welfare functions" , 
which can be expressed in terms of the cost functions 
of the individual players. Perhaps the simplest example 
is G{x) = J2i (^i9i{^)i where the (ii serve to trade off how 
much we value one player's cost vs. anothers. If we know 
the value of this social welfare function, but nothing else, 
then maxent tells us to minimize the Lagrangian of Eq.^ 

F. Bounded rational game theory 

In many situations we have prior knowledge different 
from (or in addition to) expected values of cost functions. 
This is particularly true when the players are human be- 
ings (so that behavioral economics studies can be brought 
to bear) or simple computational algorithms. To apply 
information theory in such situations, we simply need to 
incorporate that prior knowledge into our Lagrangian(s). 

To give a simple example, say that we know that the 
players all want to ensure not just a low expected cost, 
but also that the actual cost doesn't vary too much from 
one sample of q to the next. We can formalize this by say- 
ing that in addition to expected costs, our prior knowl- 
edge includes variances in the costs. Given the expected 
values of the costs, such variances are specified by the ex- 
pected values of the squares of the cost. Accordingly, all 
our prior knowledge is in the form of expectation values. 
Modifying Eq.|3| appropriately, we arrive at the solution 

g,(2;,)oce-^'«("'(^'-^')'l^'). 

where the Lagrange parameters ai and \i are given by 
the provided expectations and variances of the costs of 
the players. 

Eq.^is our best guess for what the actual mixed strat- 
egy of player z is, in light of our prior knowledge concern- 
ing that player. Note that this formula directly reflects 
the fact that player i does not care only about minimiz- 
ing cost, i.e., maximizing utility. In this, we are directly 



incorporating the possibility that the player violates the 
axioms of utility theory — something never allowed in 
conventional game theory. Other behavioral economics 
phenomena like risk aversion can be treated in a similar 
fashion. 

A variant of this scenario would have our prior knowl- 
edge only give the variances of the costs of the players 
and not their expected costs. In this cost the Lagrangian 
must involve a term quadratic in g, in addition to the 
entropy term and a term linear in q. (See the subsection 
on multiple cost functions.) More generally, our prior 
knowledge can be any nonlinear function of q. In addi- 
tion, even if we stick to prior knowledge that is linear in 
that knowledge can couple the cost functions of the play- 
ers. For example, if we know that the expected difference 
in cost of players i and j is e, the associated Lagrange 
constraint term is J dxq{x)[gi{x) — gj{x) — e]. In this sit- 
uation our prior knowledge couples the strategies of the 
players, even though those players are independent. See 
the discussion on constrained optimization in Sec. Opt. 



G. Cost of computation 

As mentioned above, bounded rationality is an un- 
avoidable consequence of the cost of computation to 
player i of finding its optimal strategy. Unfortunately, 
one cannot simply incorporate that cost into g^, and then 
presume that the player acts perfectly rationally for this 
new gi. The reason is that this cost is associated with the 
entire distribution qi{xi) that player i calculates; it not 
associated with some particular joint-strategy formed by 
sampling such a distribution. 

How might we quantify the cost of calculating qil The 
natural approach is to use information theory. Indeed, 
that cost arises naturally in the bounded rationality for- 
mulation of game theory presented above. To see how, 
for each player i define 

fi{x,qi[xi)) = Pigi{x) + h-i[qi{xi)]. 

Then we can write the maxent Lagrangian for player i as 



Li{q) ^ J '^^ q{x)fi{x,qi{xij). (4) 

Now in a bounded rational game every player sets its 
strategy to minimize its Lagrangian, given the strategies 
of the other players. In light of Eq.^ this means that we 
can interpret each player in a bounded rational game as 
being perfectly rational for a cost function that incorpo- 
rates its computational cost. To do so we simply need to 
expand the domain of "cost functions" to include proba- 
bility values as well as joint moves. 

Similar results hold for non-maxent Lagrangians. All 
that's needed is that we can write such a Lagrangian in 
the form of Eq. ^for some appropriate function 
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H. Multiple cost functions per player 

Say player i has several different cost functions {gf} 
and wants to choose a strategy that will do well at all of 
them. In the case of pure strategies we can simply define 
an aggregate function like maxj-g^ (x) or X^jlSi ^^'^ 
employ that in a conventional, single-cost-function-per- 
player game theoretic analysis. Player i will perform well 
according to such a function iff it performs well according 
to all of the constituent gf . 

One might think that for mixed strategies one could 
just "roll up" the cost functions and say that player 

i works to minimize an aggregate cost function ^' . 

However especially when player i has many cost func- 
tions, it may be that performance according to one or 
more of the constituent cost functions is quite bad even 
though the performance according to this average func- 
tion is good. Similarly, player i can have a low value 
of the expectation of the minimum of its cost func- 
tions, even though the minimum of the expected costs 
is quite high. More generally, we cannot ensure that 
Eqidi) = I d-x gl{x)qi{x)q(^i){x^i-)) has a good value for 
all j by appropriately defining an aggregate gi. Instead, 
we must "redefine" expected cost. 

We can address this by modifying our goals, in anal- 
ogy with the goals typically ascribed to players playing 
pure strategies. We do this by having the choice of cost 
function for player i be controlled by a fictional player. 
For example, conventional game theory analyzes the case 
where player i chooses a pure strategy to minimize the 
worst case (over other players' moves) cost to i, i.e., to 
minimize inaxx^iyg^{xi,X(^iy Here the analogy would be 
for the player to choose a mixed strategy to minimize the 
worst case (over moves by the fictional player) expected 
cost, i.e., to minimize uia,yijEq{gf). A similar choice, ap- 
propriate when the cost functions are all positive-definite, 
is for player i to minimize [Er,(af)]'^ .f51\ Formally, 
such functions are just Lagrangians of q. If we wish, 
we can modify them to incorporate bounded rationality, 
getting Lagrangians hke J2j Pji^qidf)]'^ ~ S{qi), where 
the Pj determine the relative rationalities of player i ac- 
cording to its various cost functions. 

These kinds of Lagrangians can also model the pro- 
cess of mechanism design, where there is an external 
designer who induces the players to adopt a desirable 
joint-strategy Q. As an example, "desirable" sometimes 
means that no single player's expected cost is high. A 
system that meets this goal fairly well can be modeled 
with a Lagrangian involving terms like '^i[Eq{gi)]^ . 

I. Shape of the Lagrangian surface 

To analyze the shape of the Lagrangian, we start with 
the following lemma, which extends the technique of La- 
grange parameters to off-equilibrium points: 



Lemma 1: Consider the set of all vectors leading from 
x' G M" that are, to first order, consistent with a set of 
constraints over R" . Of those vectors, the one giving the 
steepest ascent of a function V{x) is u = ^V + J2i fi: 
up to an overall proportionality constant, where the 
enforce the first order consistency conditions, u ■ V fi — 


This lemma can be used to establish that at the edge 
of Q, the space of all product distributions 9, the steepest 
descent direction of any player's Lagrangian points into 
the interior of Q (assuming finite (3 and {gi})- Accord- 
ingly, whereas Nash equilibria can be on the edge of Q 
(e.g., for a pure strategy Nash equilibrium), in bounded 
rational games any equilibrium must lie in the interior of 
Q. In other words, any equilibrium (i.e., any local min- 
imum) of a bounded rational game has non-zero proba- 
bility for all joint moves. So we never have to consider 
extremal mixed strategies in searching for equilibria. 

Lemma 1 can also be used to construct examples of 
games with more than one bounded rational equilibrium 
(just like there are games with more than Nash equilib- 
rium). One can also show that for every player i and 
any point q interior to Q, there are directions in Q along 
which i's Lagrangian is locally convex. Accordingly, no 
player's Lagrangian has a local maximum interior to Q. 
So if there are multiple local minima of i's Lagrangian, 
they are separated by saddle points across ridges. Sim- 
ilarly, the uniform g is a solution to the set of coupled 
equations Eq. |31for a team game, but typically is not a 
local minimum, and therefore must be a saddle point. 

Say we modify the Lagrangians to be defined for all 
possible p, not just those that are product distributions. 
For example the Lagrangian of Eq. ^ becomes 

L(p).i:fti/<ix,.wp(.)-..i-s(p)^ 

The first term in this Lagrangian is linear in p. Since en- 
tropy is a concave function of the Euclidean vector p over 
the unit simplex, this means that the overall Lagrangian 
is a convex function of p over the space of allowed p. This 
means there is a unique minimum of the Lagrangian over 
the space of all possible legal p. Furthermore, as men- 
tioned previously, for finite f3 at least one of the deriva- 
tives of the Lagrangian is negative infinite at the border 
of the allowed region of p. This means that the unique 
minimum of the Lagrangian is interior to that region, i.e., 
is a legal probability distribution. 

In general this optimal p will not be a product dis- 
tribution, of course. Rather the strategy choices of the 
players are typically statistically coupled, under this p. 
Such coupling is very suggestive of various stochastic for- 
mulations of noncooperative game theory. Coupling also 
arises in cooperative game theory, in which binding con- 
tracts couple the moves of the players 0, \^ . 

Similarly, as in proven in the appendix, the Lagrangian 
L{p) = f3J2i[^pi9i)]'^ ~ ^(p) is convex over the manifold 
of legal p, assuming non-negative (3. So the model of 
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mechanism design introduced in Sec. Ill Hi has a unique 
equihbrium — if we aUow the players to be statisticaUy 
coupled. 



J. Rationality operators 

Often our prior knowledge will not concern expected 
costs. In particular, this is usually true if our prior knowl- 
edge is provided to us before the game is played, rather 
than afterward. In such a situation, prior knowledge will 
more likely concern the "intelligences" of the players, i.e., 
how close they are to being rational. In particular, if 
we want our prior knowledge concerning player i to be 
relatively independent of what the other players do, we 
cannot use I's expected cost as our prior knowledge. Our 
prior knowledge will often concern how peaked i's mixed 
strategy is about whichever of its moves minimize its cost 
(or how peaked we can assume it to be), not the associ- 
ated minimal cost values. 

Formally, the problem faced by player i is how to set 
its mixed strategy qi{xi) so as to maximize the expected 
value of its efTective cost function, E{gi \ xi). General- 
izing, what we want is a rationality operator R(U,p) that 
measures how peaked an arbitrary distribution p{y) is 
about the minimizers of an arbitrary cost function U (y) , 
argminj/t/ (y). 

Formally, we make two requirements of R: 

1. If p{y) (X e~^^'^y\ for non-negative /3, then it is 
natural to require that the peakedness of the dis- 
tribution — its rationality value — is 

2. We also need to also specify something of i?(C/,p)'s 
behavior for non-Boltzmann p. It will suffice to 
require that of the p satisfying R{U,p) — (3, the 
one that has maximal entropy is proportional to 
g-0U{y)_ other words, we require that the Boltz- 
mann distribution maximizes entropy subject to a 
provided value of the rationality operator. 

As an illustration, a natural choice for R{U,p) would be 
the /3 of the Boltzmann distribution that "best fits" p. 
Information theory provides us such a measure for how 
well a distribution pi is fit by a distribution p2. This is 
the Kullback-Leibler distance [Ull^: 

KL{p,\\p2) = S{pi\\p2)~Sipi) (5) 

where S{pi \\ P2) = - f dy pi(y)ln[^^] is known as 
the cross entropy from pi to p2 (and as usual we im- 
plicitly choose uniform /i). The KL distance is always 
non-negative, and equals zero iff its two arguments are 
identical. 

Define N{U) = J dy e^^^y\ the normalization con- 
stant for the distribution proportional to e^^'^y\ (This 
is called the partition function in statistical physics.) 
Then using the KL distance, we arrive at the rationality 



operator 

g-/3(7 

Rkl{U,p) = argT[mipKL{p\\ j^^-^) 

= argmin^[/? j dy p{y)U{v) +\n{N{(3U))]. 

In the appendix it is proven that Rkl respects the two 
requirements of rationality operators. 

The quantity \n.{N{(3U)) appearing in the second equa- 
tion, when scaled by /3~^, is called the free energy. It 
is easy to verify that it equals the Lagrangian Ep{U) — 
S{p)/ f3 if p is given by the Boltzmann distribution p(?/) oc 

Say our prior knowledge is {pi}, the rationalities of 
the players for their associated effective cost functions. 
Introduce the general notation 

[U]i^p{xi) = J dx(^i-)U{xi,X(^i))p{x(^,) \xi), 

so that [gi]i^q is player i's effective cost function. Then 
the Lagrangian for our prior knowledge is 

Liq) = 5]A,[i?([g,],^^,g,)-A] - S{q). (6) 

i 

where the are the Lagrange parameters. Just as be- 
fore, there is an alternative way to motivate this Lagan- 
gian: if our prior knowledge consists of the entropy of 
the joint system, and we assume each player will have 
maximal rationality subject to that prior knowledge, we 
are led to the Lagrangian of Eq. |H1 

It is shown in the appendix that for the Kullback- 
Leibler rationality operator, we can replace any con- 
straint of the form -R([ffi]j ^, (Zi) = Pi with Eq{gi) = 

J dx gi [x) " ^'(p'.g.'p' %i) (a;(i) ) . In other words, knowing 
that player i has KL rationality pi is equivalent to know- 
ing that the actual expected value of gi equals the "ideal 
expected value" , where qi is replaced by the Boltzmann 
distribution of Eq. |31 with (3 — pi. This contrasts with 
the prior knowledge underlying the Lagrangian in Eq. ^ 
in which we know the actual numerical value of Eq{gi). 

Just as before, we can focus on player i by augmenting 
our prior knowledge to include the strategies of all the 
other players. The associated Lagrangian is 

Li{q,) = \i[R{[gi]i^q,qi) - Pi] - S'(g^). (7) 

(The prior knowledge concerning the strategies of the 
other players is manifested in the effective cost function.) 
It is shown in the appendix that the set of all the La- 
grangians in Eq. |7| (one for each player) are minimized 
simultaneously by any distribution of the form 

q9 ^ Lkf 

N{p^[g^]^^q) 

In addition, since this distribution obeys all the con- 
straints in the Lagrangian in Eq. El we know that there 
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exists a minimizer of that Lagrangian. All of this holds 
regardless of the precise rationality operator one uses. 

Note that the Lagrangian L,j of Eq. [7\ for player i 
arises in response to prior knowledge specific to player 
i. Changing from one player and its Lagrangian to an- 
other changes the prior knowledge. (The same is true for 
the Lagrangians in Eq.|31) In contrast, the Lagrangian of 
Eq.|Hl arises for a single unified body of prior knowledge, 
namely the set of all players' rationalities. 

For that single body of knowledge, the equilibrium of 
the game is the solution to a sm^/e-objective optimization 
problem. This contrasts with the conventional formula- 
tion of full rationality game theory, where the equilibrium 
is cast as a solution to a multi-objective optimization 
problem (one objective per player). Furthermore, for fi- 
nite P, at least one of the derivatives of the Lagrangian 
is negative infinite at the border of the allowed region of 
product distributions (i.e., at the border of the Cartesian 
product of unit simplices). Accordingly, all solutions lie 
in the interior of that region. This can be a big advan- 
tage for finding such solutions numerically, as elaborated 
below. 



K. Semi-coordinate systems 

Consider a multi-stage game like chess, with the stages 
(i.e., the instants at which one of the players makes a 
move) delineated by t. Now strategies are what are set 
by the players before play starts. So in such a multi-stage 
game the strategy of player i, Xi, must be the set of t- 
indexed maps taking what that player has observed in 
the stages t' <t into its move at stage t. Formally, this 
set of maps is called player j's normal form strategy. 

The joint strategy of the two players in chess sets their 
joint move-sequence, though in general the reverse need 
not be true. In addition, one can always find a joint 
strategy to result in any particular joint move-sequence. 
More generally, any onto mapping ^ : a; — > 0, not neces- 
sarily invertible, is called a semi-coordinate system. 
The identity mapping z — > z is a trivial example of a 
semi-coordinate system. Another example is the map- 
ping from joint-strategies in a multi-stage game to joint 
move-sequences is an example of a semi-coordinate sys- 
tem. So changing the representation space of a multi- 
stage game from move-sequences z to strategies x is a 
semi-coordinate transformation of that game. 

Typically there is overlap in what the players in chess 
have observed at stages preceding the current one. This 
means that even if the players' strategies are statistically 
independent, their move sequences are statistically cou- 
pled. In such a situation, by parameterizing the space of 
joint-move-sequences z with joint-strategies x, we shift 
our focus from the coupled distribution P{z) to the de- 
coupled product distribution, q{x). This is the advan- 
tage of casting multi-stage games in terms of normal form 
strategies. 

We can perform a semi-coordinate transformation even 



in a single-stage game. Say we restrict attention to dis- 
tributions over spaces of possible x that are product dis- 
tributions. Then changing ^(.) from the identity map 
to some other function means that the players are no 
longer independent. After the transformation their strat- 
egy choices — the components of z — are statistically 
coupled, even though we are considering a product dis- 
tribution. 

Formally, this is expressed via the standard rule for 
transforming probabilities, 

P,{z) = C(P,) = j dxP^{x)Siz - C{x)), (8) 

where ('(■) is the mapping from x to z, and Px and Pz are 
the distributions across x-space and z-space, respectively. 
To see what this rule means geometrically, let V be the 
space of all distributions (product or otherwise) over z's. 
Recall that Q is the space of all product distributions 
over X, and let C(2) be the image of Q in V. Then by 
changing C(.), we change that image; different choices of 
will result in different manifolds C(Q)- 

As an example, say we have two players, with two pos- 
sible strategies each. So z consists of the possible joint 
strategies, labeled (1, 1), (1,2), (2, 1) and (2, 2). Have the 
space of possible x equal the space of possible z, and 
choose C(l,l) = (1,1), C(l,2) = (2,2), C(2, 1) = (2,1), 
and C(2,2) — (1,2). Say that q is given by qi{xi — 
1) = (72(2:2 = 1) = 2/3. Then the distribution over 
joint-strategies z is P^(l, 1) = P^il, 1) = 4/9, Pzi2, 1) = 
P,(2,2) = 2/9, P,(l,2) = 1/9. So P,(z) / P,(zi)P,(z2); 
the strategies of the players are statistically coupled. 

Such coupling of the players' strategies can be viewed 
as a manifestation of sets of potential binding contracts. 
To illustrate this return to our two player example. Each 
possible value of a component Xi determines a pair of 
possible joint strategies. For example, setting xi — 1 
means the possible joint strategies are (1,1) and (2,2). 
Accordingly such a value of Xi can be viewed as a set 
of proffered binding contracts. The value of the other 
components of x determines which contract is accepted; 
it is the intersection of the proffered contracts offered 
by all the components of x that determines what single 
contract is selected. Continuing with our example, given 
that xi — 1, whether the joint-strategy is (1, 1) or (2,2) 
(the two options offered by xi) is determined by the value 

of X2- 

Binding contracts are a central component of coopera- 
tive game theory. In this sense, semi-coordinate transfor- 
mations can be viewed as a way to convert noncoopera- 
tive game theory into a form of cooperative game theory. 

While the distribution over x uniquely sets the distri- 
bution over z, the reverse is not true. However so long as 
our Lagrangian directly concerns the distribution over x 
rather than the distribution over z, by minimizing that 
Lagrangian we set a distribution over z. In this way 
we can minimize a Lagrangian involving product distri- 
butions, even though the associated distribution in the 
ultimate space of interest is not a product distribution. 
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The Lagrangian we choose over x should depend on our 
prior information, as usual. If we want that Lagrangian 
to include an expected value over z's (e.g., of a cost func- 
tion), we can directly incorporate that expectation value 
into the Lagrangian over x's, since expected values in x 
and z are identical: / dzPz{z)A{z) = ^ dxPx{x)A{C,{x)) 
for any function A{z). (Indeed, this is the standard justi- 
fication of the rule for transforming probabilities, Eq.[Hl) 

However other functionals of probability distributions 
can differ between the two spaces. This is especially com- 
mon when C(.) is not invertible, so the space of possible 
X is larger than the space of possible z. For example, in 
general the entropy of a g S Q will differ from that of its 
image, C,{q) £ C(Q) in such a case. (The prior probabil- 
ity fj, in the definition of entropy only gives us invariance 
when the two spaces have the same cardinality.) A cor- 
rection factor is necessary to relate the two entropies. 

In such cases, we have to be careful about which space 
we use to formulate our Lagrangian. If we use the trans- 
formation C(.) as a tool to allow us to analyze bargaining 
games with binding contracts, then the direct space of 
interest is actually the x's (that is the place in which the 
players make their bargaining moves). In such cases it 
makes sense to apply all the analysis of the preceding 
sections exactly as it is written, concerning Lagrangians 
and distributions over x rather than z (so long as we re- 
define cost functions to implicitly pre-apply the mapping 
C(.) to their arguments). However if we instead use (■(■) 
simply as a way of establishing statistical dependencies 
among the strategies of the players, it may make sense 
to include the entropy correction factor in our x-space 
Lagrangian. 

An important special case is where the following three 
conditions are met: Each point z is the image under 
(^(.) of the same number of points in a;-space, n; fi{x) 
is uniform (and therefore so is /i(^)); and the Lagrangian 
in a;-space, L^, is a sum of expected costs and the en- 
tropy. In this situation, consider a z-space Lagrangian, 
Lz, whose functional dependence on Pz, the distribution 
over z's, is identical to the dependence of on P^, ex- 
cept that the entropy term is divided by n [s^l- Now 
the minimizer P*{x) of is a Boltzmann distribution 
in values of the cost function(s). Accordingly, for any 
z, P*(x) is uniform across all n points x G C,~^{z) (all 
such X have the same cost value(s)). This in turn means 
that S{C,{Px)) = nS{Pz) So our two Lagrangians give the 
same solution, i.e., the "correction factor" for the entropy 
term is just multiplication by n. 



L. Entropic prior game theory 

Finally, it is worth noting that in the real world the 
information we are provided concerning the system often 
will not consist of ea:ac< values of functionals of q, be those 
values expected costs, rationalities, or what have you. 
Rather that knowledge will be in the form of data, D, 
together with an associated likelihood function over the 



space of q. For example, that knowledge might consist of 
a bias toward particular rationality values, rather than 
precisely specified values: 

P(L» I q) oc e-"SaflKi,([9.1.„)-P.l'^ 

where a sets the strength of the bias. 

The extension of the maximum entropy principle to 
such situations uses the entropic prior, P{q) cx e"''''^'-'-'. 
Bayes' theorem is then invoked to get the posterior dis- 
tribution [l^ : 

P{q I D) oc e"^'"'[^^^^([»'l''''^~''-l'"'''^(«'. 

The Bayes optimal estimate for q, under a quadratic 
penalty term, is then given by E(q \ D). The maxent 
principle for estimating q is given by this estimate under 
the limit of all going to infinity. For finite a solv- 
ing for E{q I D) can be quite complicated though. For 
simplicity, such cases are not considered here. 



III. PD THEORY AND STATISTICAL PHYSICS 

There are many connections between bounded ratio- 
nal game theory — PD theory — and statistical physics. 
This should not be too surprising, given that many of the 
important concepts in bounded rational game theory, like 
the Boltzmann distribution, the partition function, and 
free energy, were first explored in statistical physics. This 
section discusses some of these connections. 



A. Background on statistical physics 

Statistical physics is the physics of systems about 
which we have incomplete information. An example is 
knowing only the expected value of a system's energy 
(i.e., its temperature) rather than the precise value of the 
energy. The statistical physics of such systems is known 
as the canonical ensemble. Another example is the 
grand canonical ensemble (GCE). There the number 
of particles of various types in the system is also uncer- 
tain. As in the canonical ensemble, in the GCE what 
knowledge we do have takes the form of expectation val- 
ues of the quantities about which we are uncertain, i.e., 
the number of particles of the various types that the sys- 
tem contains, and the energy the system. 

Traditionally these kinds of ensembles were analyzed 
in terms of "baths" of the uncertain variable that are 
connected to the system. For example, in the canonical 
ensemble the system is connected to a heat bath. In the 
GCE the system is also connected to a bath of particles 
of the various types. 

Such analysis showed that for the canonical ensem- 
ble the probability of the system being in the particular 
state X is given by the Boltzmann distribution over the 
associated value of the system's energy, G{x), with (3 
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interpreted as the (inverse) temperature of the system: 
p{x) cx e~^'^''^\ This result is independent of the details 
characteristics of the physical system; all that is impor- 
tant is the Hamiltonian G{x), and temperature /3. 

Note that once one knows p{x) and G{x), one knows 
the expected energy of the system. It is G{x) that is a 
fixed property of the system, whereas /3 can vary. Ac- 
cordingly, specifying /3 is exactly equivalent to specifying 
the expected energy of the system. 

In the case of the GCE, x implicitly specifies the num- 
ber of particles of the various types, as well as their 
precise state. The analysis for that case showed that 
p{x) (X e-l^'^i^)-T,i t^i^ii _ In this formula /3 is again the 
inverse temperature, Ui is the number of particles of type 
i, and /ii > is the chemical potential of each particle 
of type i. 

Jaynes was the first to show that these results of con- 
ventional statistical physics could be derived without re- 
course to artificial notions like "baths" , simply by using 
the maxent principle. In particular, he used the exact 
reasoning in Sec. Ill Fi to derive the fact that the canoni- 
cal ensemble is governed by the Boltzmann distribution. 



B. Mean field theory and PD theory 

In practice it can be quite difficult to evaluate this 
Boltzmann distribution, due to difhculty in evaluating 
the partition function. For example, in a spin glass, 
X is an A'^-dimensional vector of bits, one per particle, 
and G{x) — J2i j ^i.j^i^j- the partition function 
is given by Jdxe"^* ^ where if is a symmet- 

ric real-valued matrix, and as before we use J to indicate 
the integral according to the appropriate measure (here 
a point-sum measure). In general, evaluating this sum 
for large numbers of spins cannot be done in closed form. 

Mean Field (MF) theory is a technique for getting 
around this problem by approximating the partition 
function. Intuitively, it works by treating all the parti- 
cles as independent. It does this by replacing some of the 
values of the state of a particle in the Hamiltonian by its 
average state. For example, in the case of the spin glass, 
one approximates J2i j Hij[xi — E(xi)][xj — E{xj)] ~ 0, 
where the expectation values are evaluated according to 
the associated exact Boltzmann distribution, i.e., one as- 
sumes that fluctuations about the means are relatively 
negligible. This then means that 



tribution, leaving us with the distribution 



pP^ix) « pP^ix) 



where 



p-CliXi 

Y\— 

f dx^ e-"'="' ' 

i ^ 

a, = 2(3Y,H^,JE{xj). 



This approximation P^^ [x) is far easier to work 
with than the exact Boltzmann distribution, p^^ (x) = 

since each term in the product is for a single spin 



by itself. In particular, if we adopt this approximation 
we can use numerical techniques to solve the associated 
set of simultaneous equations 



E{x,) = 



_d_ 

da. 



for the E{xi) (so that those E{xi) are no longer exactly 
equal to the expected values of the {xi\ under the distri- 
bution p^^{x)). Given those E{xi) values, we can then 
evaluate the associated approximate Boltzmann distribu- 
tion explicitly. 

The mean field approximation to the Boltzmann dis- 
tribution is a product distribution, and in fact is identical 
to the product distribution of bounded rational game 
theory, for the team game where giix) = 2(3G{x) Vi. Ac- 
cordingly, the "mean field theory" approximation for an 
arbitrary Hamiltonian U can be taken to be the associ- 
ated team game , which is defined for any U. 

This bridge between bounded rational game theory and 
statistical physics means that many of the powerful tools 
that have been developed in statistical physics can be ap- 
plied to bounded rational game theory. They also mean 
that PD theoretic techniques can be applied in statisti- 
cal physics. In particular, it is shown elsewhere pol I2H 
that if one replaces the identical cost function of each 
player in a team game with different cost functions, then 
the bounded rational equilibrium of that game can be 
numerically found far more quickly. In the context of 
statistical physics, this means that numerically solving 
for a MF approximation may be expedited by assigning 
a different Hamiltonian to each particle. 



Information-theoretic misfit measures 



G(x) »^iJ,,,2:E,i;(:E,) - Y,H,,,E{x^)E{xj), 

The second sum in this approximation cancels out when 
we evaluate the associated approximate Boltzmann dis- 



The proper way to approximate a target distribution p 
with a distribution from a set C is to first specify a misfit 
measure saying how well each member of C approximates 
p, and then solve for the member with the smallest mis- 
fit. This is just as true when C is the set of all product 
distributions as when it is any other set. 
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How best to measure distances between probability 
distributions is a topic of ongoing controversy and re- 
search |2^. The most common way to do so is with the 
infinite limit log likelihood of data being generated by 
one distribution but misattributed to have come from 
the other. This is know as the Kullback-Leibler dis- 
tance [Sill 113: 

KL{p^\\p2) = S{pi\\p2)-S{pi) (9) 

where S{pi \ \ P2) = ~ J dx pi(a:)ln[£^^] is known as the 
cross entropy from pi to p2 (and as usual we implic- 
itly choose uniform /i). The KL distance is always non- 
negative, and equals zero iff its two arguments are identi- 
cal. However it it is far from being a metric. In addition 
to violating the triangle inequality, it is not symmetric 
under interchange of its arguments, and in numerical ap- 
plications has a tendency to blow up. (That happens 
whenever the support of pi includes points outside the 
support of P2-) 

Nonetheless, this is by far the most popular measure. 
It is illuminating to use it as our misfit measure. As 
shorthand, define the "pq distance" as KL{p \\ q), and 
the "gp distance" as KL{q \\ p, where p is our target 
distribution and g is a product distribution. Then it is 
straightforward to show that the qp distance from q to 
target distribution p^^ is just the maxent Lagrangian, 
up to irrelevant overall constants. In other words, the 
q minimizing the maxent Lagrangian — the distribution 
arising in MF theory — is the q with the minimal qp 
distance to the associated Boltzmann distribution. 

However the qp distance is the (infinite limit of the 
negative log of) the likelihood that distribution p would 
attribute to data generated by distribution q. It can be 
argued that a better measure of how well q approximates 
p would be based on the likelihood that q attributes to 
data generated by p. This is the pq distance. Up to an 
overall additive constant (of the canonical distribution's 
entropy) , the pq distance is 

KL{p II g) = - ^ J dx p{x)\n[qi{x,)]. 

i 

This is equivalent to a team game where each coordinate 
i has the "Lagrangian" 

L*{q) = - J dx, pi{x,)ln[q,{,)], 

where Pi{xi) is the marginal distribution J dx(^{)p{x). 

The minimizer of this is just qi = pi Vi, i.e., each qi 
is set to the associated marginal distribution of p. So in 
particular, when our target distribution is the canonical 
ensemble distribution p^^ , the optimal q according to pq 
distance is the set of marginals of p^^ . Note that unlike 
the solution for qp distance, here the solution for each 
qi is independent of the g( j) . So we don't have a game 
theory scenario; we do not need to pay attention to the 
when estimating each separate qi. Correspondingly, 



whereas there are many local minima of the team game 
Lagrangian studied above, g G Q ^ KL{q \ \ p^'^), there 
is only one, global minimum of g e Q ^ KL{p^ || g). 

Another difference between the two kinds of KL dis- 
tance is how the associated optimal product distributions 
are typically calculated numerically. The product distri- 
bution that optimizes the maxent Lagrangian is usually 
found via derivative-based traversal of that Lagrangian, 
or techniques like (mixed) Brouwer updating|23, |2ll |23, 
|23,I36|. In contrast, the integral giving each marginal dis- 
tribution of p is usually found via adaptive importance 
sampling of the associated integral, with the proposal 
distribution for the integral to approximate pi set adap- 
tivcly, as g(i)|23. 

It is possible to motivate yet other choices for the g 
that best approximates p^^ . To derive one of them, start 
with Lemma 1, with M" set to the space of real- valued 
functions over the set of x's (so that n is the number of 
possible x). Have a single constraint / that restricts us to 
the unit simplex in R", i.e., that restricts us to the set 
of functions that (assuming they are nowhere-negative) 
are probability distributions. Choose V to be the associ- 
ated Lagrangian, L{p) — (3Ep{G) — S{p), p being a point 
in our constrained submanifold of M". Note that this p 
can be any distribution over the x's, including one that 
couples the components {xi}. 

Say we are at some current product distribution q. 
Then we can apply Lemma 1 with the choices just out- 
lined to tell us what direction to move from g in so 
as to reduce the Lagrangian. In general, taking a step 
in that direction will result in a distribution p' that is 
not a product distribution. However we can solve for the 
product distribution that is closest to that p' , and move 
to that product distribution. By iterating this procedure 
we can define a search over the submanifold of product 
distributions. We can then solve for the product distri- 
bution at which this search will terminate. 

To do this, of course, we must define what we mean by 
"closest" . Say that we choose to measure closeness by pq 
distance. Then the terminating production distribution 
is the one for which the marginals of VL + AV/ all equal 
0. For each i, this means that 

J rfa;(,) [(3G{x) + Hp{x)) -f 1 + A] = 

at the equilibrium product distribution p. Writing out 
P — Yiili ^-iid evaluating gives 

, , , [ dx(.i\G(x) 

g.(x,)(xexp(-/3 -^ „ j'' \' ). (10) 



This is akin to the q^ of a bounded rational game, except 
that each player/particle i sets its distribution by evalu- 
ating conditional expected U with a uniform distribution 
over the rather than with g(j). 
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D. Semi-coordinate transformations 

Let's say there are numerical difficulties with our find- 
ing a q that is local minimization of the maxent La- 
grangian. That q might still be a poor fit to p{x) if it 
is far from the global minimizer of the Lagrangian. Fur- 
thermore, even the global minimizer might be a poor fit, 
if p{x) simply can't be well-approximated by a product 
distribution. 

There are many techniques for improving the fit of a 
product distribution to a target distribution in machine 
learning and statistics 27]. To give a simple example, 
say one wishes to approximate the target distribution in 
with a product of Gaussians, one Gaussian for each 
coordinate. Even if the target distribution a Gaussian, if 
it is askew, then one won't be able to do a good job of 
approximating it with a product of Gaussians. However 
one can use Principal Components Analysis (PGA) to 
find how to rotate one's coordinates so that a product of 
Gaussians fits the target exactly. 

Similar techniques can address both the issue of break- 
ing free of local minima of the Lagrangian, and improving 
the accuracy of the best product distribution approxima- 
tion to p. More precisely, identify x with the variables z 
discussed in Sec. IIIKI Then consider changing the map 
<^(.) : a; — > z from the identity map. This will in general 
change the mapping from Px to Lz{C{Px))- So if Lz is 
the Lagrangian we are interested in, the mapping from 
product distributions over x can be changed by changing 
Ci-), in general. 

As an example, consider the case where the space of x's 
is identical to the space of z's, and consider all possible 
bijective transformations C(-)- Entropy is the same in 
both spaces for any C, i.e., S{Pz) = S{CiPx)) = S{Px). 
So for fixed P^, the entropy in z-space is independent of 
(■(.). However if we fix P^ and change ({.) the expected 
values of utilities will change. So Lz{({Px)) does depend 
on ({.), as claimed. 

This means that by changing while leaving q^ un- 
changed, we will in general change whether we are at a 
local minimum of Lz{C,{qx))- Furthermore, such a change 
will change how closely the global minimizer of Lz{C{qx)) 
approximates any particular target distribution. Indeed, 
some such transformation will always transform a team 
game to have a strictly convex maxent Lagrangian, with 
only one (bounded rational) equilibrium, an equilibrium 
that is in the interior of the region of allowed q and 
that has the lowest possible value of the Lagrangian. 
In the worst case, we can get this behavior by trans- 
forming to the semi-coordinate system in which x is one- 
dimensional, so that any p{z) — coupling its variables or 
not — can be expressed as a q{x) — qi{xi). 

Note that unlike with PGA, semi-coordinate transfor- 
mations can be used for non-Euclidean semi-coordinates 
(i.e., when neither x's nor z's are Euclidean vectors). 
They also can be guided by numerous measures of the 
goodness of fit to the target distribution (e.g., KL dis- 
tance), in contrast to PGA's restriction to assuming a 



Gaussian likelihood. 



E. Bounded rational game theory for variable 
number of players 

The bridge between statistical physics and bounded ra- 
tional game theory have many uses beyond the practical 
ones alluded to the previous subsection. In particular, 
it suggests extending bounded rational game theory to 
ensembles other than the canonical ensemble. As an ex- 
ample, in the GGE the number of particles of the various 
allowed types is uncertain and can vary. The bounded 
rational game theory version of that ensemble is a game 
in which the number of players of various types can vary. 

We can illustrate this by extending a simple instance 
of evolutionary game theory |0| to incorporate bounded 
rationality and allow for a finite total number of play- 
ers. Say we have a finite population of players, each of 
which has one of m' possible types. (These are some- 
times called feature vectors in the literature.) Each 
player i in the population is randomly paired with a dif- 
ferent player j, and they each choose a strategy for a two- 
person game. The set of strategies each of those players 
can choose among is fixed by its respective attribute vec- 
tor. In addition the cost player i receives depends on the 
attribute vectors of itself and of j, in addition to their 
joint strategy. Finally, to reflect this dependence, we al- 
low each player to vary its strategy depending on the 
attribute vector of its opponent; we call player i's meta- 
strategy the mapping from its opponent's attribute vec- 
tor to i's strategy. |53l |. 

We encode an instance of this scenario in an x with 
a countably infinite number of dimensions. Xi^ = ni{x) 
specifies the number of players of type i, with n{x) be- 
ing the vector of the number of players of all types. For 
1 < j < Xifi, Xij = Sij{x) the meta-strategy selected 
by the j'th player of type i. If its opponent is the j'th 
player of type T', the cost to the z'th player of type T 
is gT,i,T',j{x) = gT,i,T' ,jis, s' ,nT,nT'), where s and s' 
are the two players' respective meta-strategy. To enforce 
consistency between the index numbers i,j and the asso- 
ciated numbers of players, we set gT,i,T',j{s, s',fi) = if 
either i > tit or j > ut' ■ 

To start we parallel the GGE, and presume that for 
each type we know the expected number of players hav- 
ing that type, and the expected cost averaged over all 
players having that type. Also stipulate that the distri- 
bution over x is a product distribution, q. Then our prior 
information specifies the values of 

and 
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} 



respectively, for all types T. (The sums over j and fc all 
implicitly extend from 1 to oo, and the delta functions 
are Kronecker deltas that prevent a player from playing 
itself.) 

We can write these expressions as expectation values, 
over cc, of 2m' functions. These functions are the to' 
functions nrix) = xt,q (one function for each T) and 
the to' functions 

cAx) ^ ^"'^•■^^'^"'^-''--^"---'--^"^^9(x,,) 

respectively, where Q is the Heaviside theta function that 
equals 1 if its argument exceeds 0, and equals otherwise. 
Accordingly, the maxent principle directs us to minimize 
the Lagrangian 

^(g) ^~Y^^'^^^^'^T)-NT)+|3T{E{CT)-CT)] - S{q) 



where the integers {Nt} and real numbers {Ct} are our 
prior information. In the usual way, the solution for each 
pair (i G {1, . . . ,to'}, j > 0) is 

where the values of the Lagrange parameters are all set 
by our prior information. 

This distribution is analogous to the one in the GCE. 
As usual, one can consider variants of it by focusing on 
one variable at a time, having prior knowledge in the 
form of rationality values, etc. In addition, even if we 
stay in this random-2-player games scenario, there is no 
reason for us to restrict attention to prior information 
paralleling that of the GCE. As with bounded rational 
game theory with a fixed number of players, our prior 
information can concern nonlinear functions of g, couple 
the cost functions, etc. 

In particular, in evolutionary game theory we do not 
know the expected number of players having each type, 
nor their average costs. In addition, the equilibrium con- 
cept stipulates that all players will have type T if a par- 
ticular condition holds. That condition is that the addi- 
tion of a player of type other than T to the population 
results in an expected cost to that added player that is 
greater than the associated expected cost to the players 
having type T . This provides a model of the phenotypic 
interactions underlying natural selection. 



We can encapsulate evolutionary game theory in a La- 
grangian by appropriately replacing each pair of GCE- 
type constraints (one pair for each type) with a single 
constraint. As an example, we could have the (single) 
constraint for type T be that 



E{ 



= m 



max^, (c^, ) - min^, (c^, ) 



') (11) 



for some positive real value 7. For finite 7, the entropy 
term in the Lagrangian ensures that for no T is the expec- 
tation value in the lefthand side of this constraint exactly 
0. 

In the limit of infinite 7, the distribution minimizing 
this Lagrangian is non-infinitesimal only for the evolu- 
tionarily stable strategies of conventional evolution- 
ary game theory. These are the (type, strategy) pairs 
that are best performing, in the sense that no other pair 
has a lower cost function value. The distribution for fi- 
nite 7 can be viewed as a "bounded rational" extension 
of conventional evolutionary game theory. In that exten- 
sion (type, strategy) pairs are allowed even if they don't 
have the lowest possible cost, so long as their cost is close 
to the lowest possible [s^ . 

There is always a solution to this Lagrangian (un- 
like the case in conventional full rationality evolutionary 
game theory). The technique of Lagrange parameters 
provides that solution for each pair (i e {1, . . . ,to'}, j > 
0) in the usual way: 

where the Lagrange parameters enforce our constraint, 
and 



n , 

T' 



En ,, max „ (c ,, ) — min „ (c „ ) 



More general forms of evolutionary game theory al- 
low games with more than two players, and localization 
via network structures delineating how players are likely 
to be grouped to play a game. Other elaborations have 
each player not know the exact attribute vectors of all its 
opponents, but only an "information structure" provid- 
ing some information about those opponents' attribute 
vectors. All such extensions can be straightforwardly in- 
corporated into the current analysis. Many other exten- 
sions are simple to make as well. For example, since the 
cost functions have all components of n in their argument 
lists, they can depend on the total size of the population. 
This allows us to model the effect on population size of 
finite environmental resources. 

Note that if we change how we encode the number of 
players of the various types and their joint meta-strategy 
in X, we change the form of the expectations in Eq. ^2 
This reflects the fact that by changing the encoding we 
change the implication of using a product distribution. 
Formally, such a change in the encoding is a change in 
the semi-coordinate system. See Sec. Ill Kl 
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IV. APPENDIX 

This appendix provides proofs absent from the main 
text. 



A. P^i[Ep{gi)]'^ — S{p) is convex over the unit 
simplex 

Proof: Since S{p) is concave over the unit simplex, 
and the unit simplex is a hyperplane, it suffices to prove 
that X]J^p(ffi)]^ is convex over all of Euclidean space. 
Since a weighted average of convex functions is convex, 
we only need to prove that any single function of the form 
[/ dx p{x)f{x)]'^ is convex. The Hessian of this function 
is 2f{x)f{x'). Rotate coordinates so that / is a basis 
vector, i.e., so that / is proportional to a delta function. 
This doesn't change the eigenvalues of the Hessian. After 
this change though, the Hessian is diagonal, with one 
non-zero entry on the diagonal, which is non-negative. 
So its eigenvalues are zero and a non- negative number. 
QED. 



B. Rkl is a rationality operator 

Proof: Since KL distance only equals when its ar- 
guments match and is never negative, requirement (1) 
of rationality operators holds for Rkl- Next, since 
Rkl = argmin^[/3 / dy p{y)U{y) + \n{N{j3U))], we know 



that Ep{U) 



dN{l3U) I 



\i3=Rkl{U,p)- 



Accordingly, 



all p with the same rationality have the same expected 
value Ep{U). Using the technique of Lagrange parame- 
ters then readily establishes that of those distributions 
having the same expected U, the one with maximal en- 
tropy is a Boltzmann distribution. Furthermore, by re- 
quirement (1), we know that for a Boltzmann distribu- 
tion the exponent /3 must equal the rationality of that 
distribution. QED. 



C. Alternative form of a constraint on Rkl 

Proof: Let /{a, v} be any function that is monotoni- 
cally decreasing in its (real- valued) first argument. Then 
any constraint R{[gi\i,q, qi) — pi = is satisfied iff the con- 
straint f{R{[gi]i,q,qt),q(i)} ~ f{Pi,q{i)} = is satisfied. 
Choose 

91n(jV(/3[g,],,,)) , 



9/3 



/ dXi[gt 



Differentiating this quantity with respect to a gives the 

under the Boltzmann 



negative of the variance of [giii,q 



this derivative is non-positive, which establishes that / 
is monotonically decreasing in its first argument. 
Evaluating, 

-piE(gi\xi) 



f{Pi,(l{i)} = dx g,{x) 



N{ptgt 



In addition, from the equation defining Rkl, we know 
that 



HN{pU{x ,))) , 
df3 



dXiqi{xi)U{xi) 



for any function U . Plugging in U = [gi\i^q, we see that 

f{R{[m]i,q,<li),<l(i:)) = j dx^qi{xi)[gi\i^q{xi) 
= Eq{g{).CiED. 

D. minimizes the Lagrangians of Eg. I?l 

Proof: Following Nash, we can use Brouwer's fixed 
point theorem to establish that for any non- negative {pi}, 
there must exist at least one product distribution given 
by q^ . The constraint term in all the Li of Eq. d is 
zero for this distribution. By requirement (2), we also 
know that given g^^-j (and therefore [5i]i,gs), there is no 
qi with rationality pi that has lower entropy than qf . 
Accordingly, no qi will have a lower value of Li. Since 
this holds for all i, q^ minimizes all the Lagrangians in 
Eq. 13 simultaneously. QED. 

E. Derivation of Lemma 1 

Proof: Consider the set of u such that the directional 
derivatives Dufi evaluated at x' all equal 0. These are 
the directions consistent with our constraints to first or- 
der. We need to find the one of those u such that D^g 
evaluated at x' is maximal. 

To simplify the analysis we introduce the constraint 
that |uj = 1. This means that the directional derivative 
DfiV for any function V is just u ■ W. We then use La- 
grange parameters to solve our problem. Our constraints 
on u are u'^ = 1 and Dafi{x') = u ■ Vfi{x') = Vi. 
Our objective function is DiiV{x') = u- W{x'). 

Differentiating the Lagrangian gives 



with solution 



2Ao 



distribution 



Since variances are non-negative. 



Ao enforces our constraint on Since we are only in- 
terested in specifying u up to a proportionality constant, 
we can set 2Ao = 1. Redefining the Lagrange parameters 
by multiplying them by —1 then gives the result claimed. 
QED. 
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F. Proof of claims following Lemma 1 



values x] and if, 



i) Define fi{q) = J dxiqi{xi), i.e., fi is the constraint 
forcing qi to be normalized. Now for any q that equals 
zero for some joint move there must be an i and an x'^ such 
that qiix[) = 0. Plugging into Lemma 1, we can evaluate 
the component of the direction of steepest descent along 
the direction of player I's probability of making move x'f. 



(]Eg,{g,\xl)+ln{q[{xl)) 

-pEg,{g,\xl) + Hqlix^)) 

13 j dx(09i{x\,X(i))W_qj{T{x.j)) + \n{q^{T [x]))) 

-pj dx(^,^g,{xj,x^,-))Y[qj{T{xj))) + ln{q^{T {x^))) 



dqi{xi) dqi{xi) 

f3E{gi I Xi) + ln((/i(.Tj)) - 



Jdx':[m9^\x'D+HQ^i<))] 

Sdx'll 



Since there must some x'l such tha qi{x'l) ^ 0, 3xi such 
that f3E{gi \ x'l) + h\{qi{x'l)) is finite. Therefore our 
component is negative infinite. So Li can be reduced by 
increasing qi{x'j). Accordingly, no q having zero prob- 
ability for some joint move x can be a minimum of Vs 
Lagrangian. 



ii) To construct a bounded rational game with multiple 
equilibria, note that at any (necessarily interior) local 
minimum g, for each j. 



l3E{gi I Xi) + \n{qi{xi)) = 

(3 j dx(i)gi{xi,X(i))'Wqj{xj) +\n{qi{xi)) 



must be independent of Xj, by Lemma 1. So say 
there is a component-by-component bijection T{x) = 
(ri(xi), r2(x2), . . .) that leaves all the {g,} unchanged, 
i.e., such that gj{x) — gj{T{x)) Va;, j |55i |. 

Define q' by q'{x) — q{T{x)) Va;. Then for any two 



-pj dx(^,-)g^{xf,T^^{x^,)))Y[qjixj)) + ln{q,{T{xl))) 

(3 j dx(Ci9t{T{x\),X(^i)))'Wqj{xj) + \n{q,{T{x\))) 

- (3 j dx(i)g^{T{xl),X(o))Wq3{xj)) + ln((?j(T(xf))) 

PEgig, I T{xl)) + \n{q,{T{xl))) 

- PE.ig, I Tix^)) + Hq,{T{xm 



where the invariance of gi was used in the penultimate 
step. Since g is a local minimum though, this last differ- 
ence must equal 0. Therefore q' is also a local minimum. 

Now choose the game so that \fi, Xi,T{xi) ^ Xi. (Our 
congestion game example has this property.) Then the 
only way the transformation q —>■ q{T) can avoiding 
producing a new product distribution is if qi{xi) = 
qi{x'i) \/i,Xi,x'i, i.e., q is uniform. Say the Hessians of 
the players' Lagrangians are not all positive definite at 
the uniform q. (For example have our congestion game 
be biased away from uniform multiplicities.) Then that 
q is not a local minimum of the Lagrangians. Therefore 
at a local minimum, q ^ q{T)- Accordingly, q and q{T) 
are two distinct equilibria. 

iii) To establish that at any q there is always a direction 
along which any player's Lagrangian is locally convex, fix 
all but two of the {^i}, go and qi, and fix both go a-nd qi 
for all but two of their respective possible values, which 
we can write as go(0), (7o(l)j 9i(0), and gi(l), respectively. 
So we can parameterize the set of q we're considering by 
two real numbers, x = qo(0) and y = qi{0). The 2x2 
Hessian of Li as a function of x and y has the entries 



1 

X 



1 



a — a; 
a 



a 

1 1 

- + r- 

y - 



y 



where a = 1 — qo{0) — qo{l) and b = 1 — (71(0) — gi(l), 
a is a function of gi and Ilj^iio 1 ^i' Defining s = ^ + 



and 
1 
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and t = - + -r^, the eigenvalues of that Hessian are 
s + t± ^4a2 + (s - ty 



The eigenvalue for the positive root is necessarily posi- 
tive. Therefore along the corresponding eigenvector, Li 
is convex at q. QED. 
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